3.4 How to generate time-aggregated values - collapse
In addition to aggregating data into a higher unit level1, e.g. from person level into family level (or municipality level), the command collapse
may also be used as a tool for statistical measurements aggregated over a specified time span. In practice, this is the same as aggregating data from event/longitudinal level into person level. Examples may be calculations of a state duration measured over a given time interval, retrieval of status in a given time interval, retrieval of number of occurrences in given states in a given time interval, or summation of values over a given time interval.
This is done on event-organized data sets (see section 2.3.2) through the following command:
collapse (<aggregate measure>) <dataset>, by(<unit-id>)
Type of aggregation is required as input in the parenthesis following
collapse
, and then the name on an event organized dataset. Aggregation type may be as follows:
-
max
maximum value -
min
minimum value -
mean
mean value -
median
median valuei -
count
number of values -
sum
sum of values -
semean
standard error of mean value -
sebinomial
binomial standard error of mean value -
sd
standard deviation -
percent
percentage valid values -
iqr
interquartile range (range between 75th and 25th percentiles)
The option by(<unit-id>)
is used to specify which unit type to
aggregate over. This will usually be individuals, given by the unit
identification number contained by the key variabel PERSONID_1.
Example 1: Calculate the number of times the individuals have changed their marital status during 2000-2005
require no.ssb.fdb:23 as db
create-dataset maritalevent
import-event db/SIVSTANDFDT_SIVSTAND 2000-01-01 to 2005-01-01 as
maritalperiod
collapse (count) maritalperiod, by(PERSONID_1)
rename maritalperiod maritalstates
replace maritalstates = maritalstates - 1
tabulate maritalstates
Example 2: Calculate the number of divorces per individual during 2000-2005
require no.ssb.fdb:12 as db
create-dataset maritalevent
import-event db/SIVSTANDFDT_SIVSTAND 2000-01-01 to 2005-01-01 as
maritalperiod
keep if maritalperiod == '4'
collapse (count) maritalperiod, by(PERSONID_1)
rename maritalperiod divorces
tabulate divorces
Note that the variable maritalperiod initially contains data on marital status (each new record represents a change in marital status). However, through the steps in the examples, the variable is transformed from containing event level data into containing the count
-value measured over the 2000-2005 period for the specific unit level (= individual). Thus, following the collapse
-procedure, the variable maritalperiod will now contain the number of marital statuses measured per individual over the period (example 1) or the number of divorces per individual measured over the same period (= the number of records containing the value '4' which represents the status "divorced") (example 2).
NB! In order to be able to continue working with the aggregated value
generated through collapse
, the dataset needs to be linked with the
other variables placed in the main analysis dataset built through the
import
-procedure (see section 2.3.1). See section 2.8 on how to do this.
Footnotes
-
See section 2.8 ↩